8-5 DTW for Speaker Identification

One of the typical applications of DTW is text-independent speaker identification. The application is divided into two stages:

  1. At the registration stage, each speaker is required to pronounce several utterances as the spoken passwords.
  2. At the application stage, the speaker pronounces one of the spoken keywords and the system is required to find the identity of the speaker by comparing the spoken keywords against those keywords received at the registration stage. The comparisons are usually achieved by DTW for its robustness to variance in speech rate.

For instance, within the "dataSet" directory of the ML toolbox, we have a collection of recordings consisting of two sessions of 3 subjects each. Each subject was requrested to pronounce 10 spoken passwords for 3 times at each session. So each folder within a session contains 30 recordings for each subject.

First of all, it is a good programming habit to put all parameters related to our task into a function, which return a structure variable containing all parameters:

Example 1: speakerIdTextDependent/sidPrmSet.mfunction sidPrm=sidPrmSet % sidPrmSet: Set parameters for speaker identification % ====== Add required toolboxes to the search path mltPath='/users/jang/matlab/toolbox/dcpr'; addpath(mltPath); % ====== Wave directories sidPrm.waveDir01=sprintf('%s\dataSet\speakerIdTextDependent\session01', mltRoot); sidPrm.waveDir02=sprintf('%s\dataSet\speakerIdTextDependent\session02', mltRoot); sidPrm.feaType='mfcc'; % 'mfcc', 'volume', 'pitch' sidPrm.outputDir='output'; % Output directory

Note that the above function also puts required toolboxes into the MATLAB search path.

To read the data from the folder, see the next example:

Example 2: speakerIdTextDependent/goFeaExtract.m% Feature extraction sidPrm=sidPrmSet; % ====== Read session 1 speakerData1=speakerDataRead(sidPrm.waveDir01); fprintf('Get wave info of %d persons from %s\n', length(speakerData1), sidPrm.waveDir01); speakerData1=speakerDataAddFea(speakerData1, sidPrm); % Add features to speakerData1 % ====== Read session 2 speakerData2=speakerDataRead(sidPrm.waveDir02); fprintf('Get wave info of %d persons from %s\n', length(speakerData2), sidPrm.waveDir02); speakerData2=speakerDataAddFea(speakerData2, sidPrm); % Add features to speakerData2 fprintf('Save speakerData1 and speakerData2 to speakerData.mat\n'); save speakerData speakerData1 speakerData2Get wave info of 3 persons from \users\jang\matlab\toolbox\dcpr\dataSet\speakerIdTextDependent\session01 1/3: Feature extraction from 30 recordings by 9761215 ===> 0.444344 sec 2/3: Feature extraction from 30 recordings by 9761217 ===> 0.264503 sec 3/3: Feature extraction from 30 recordings by 9762115 ===> 0.348345 sec Get wave info of 3 persons from \users\jang\matlab\toolbox\dcpr\dataSet\speakerIdTextDependent\session02 1/3: Feature extraction from 30 recordings by 9761215 ===> 0.353603 sec 2/3: Feature extraction from 30 recordings by 9761217 ===> 0.349893 sec 3/3: Feature extraction from 30 recordings by 9762115 ===> 0.311077 sec Save speakerData1 and speakerData2 to speakerData.mat

To check data consistency, see the next example:

Example 3: speakerIdTextDependent/goDataCheck.m% Check data consistency for speaker identification. load speakerData.mat sidPrm=sidPrmSet; % ====== Check empty speaker folder in session01 sentenceNum=[speakerData1.sentenceNum]; index=find(sentenceNum==0); speakerData1Empty=speakerData1(index); title='Speakers with no recordings in session 1'; outputFile=sprintf('%s/%s.htm', sidPrm.outputDir, title); if ~isempty(speakerData1Empty), structDispInHtml(speakerData1Empty, title, {'name'}, [], [], outputFile); end speakerData1(index)=[]; % ====== Check empty speaker folder in session02 sentenceNum=[speakerData2.sentenceNum]; index=find(sentenceNum==0); speakerData2Empty=speakerData2(index); title='Speakers with no recordings in session 2'; outputFile=sprintf('%s/%s.htm', sidPrm.outputDir, title); if ~isempty(speakerData2Empty), structDispInHtml(speakerData2Empty, title, {'name'}, [], [], outputFile); end speakerData2(index)=[]; % ====== Speaker difference in both sessions speaker1={speakerData1.name}; speaker2={speakerData2.name}; diffSet1=setdiff(speaker1, speaker2); diffSet2=setdiff(speaker2, speaker1); % === Speaker in session01 but not in session02 index1=[]; for i=1:length(diffSet1) index1=[index1, find(strcmp(diffSet1{i}, speaker1))]; end title='Speakers only in session 1'; outputFile=sprintf('%s/%s.htm', sidPrm.outputDir, title); if ~isempty(index1), structDispInHtml(speakerData1(index1), title, {'name'}, [], [], outputFile); end speakerData1(index1)=[]; % === Speaker in session02 but not in session01 index2=[]; for i=1:length(diffSet2) index2=[index2, find(strcmp(diffSet2{i}, speaker2))]; end title='Speakers only in session 2'; outputFile=sprintf('%s/%s.htm', sidPrm.outputDir, title); if ~isempty(index2), structDispInHtml(speakerData2(index2), title, {'name'}, [], [], outputFile); end speakerData2(index2)=[];

To evaluate the performance using DTW, try the next example:

Example 4: speakerIdTextDependent/goPerfEval.m% Performance evaluation load speakerData.mat % ====== Speaker ID by DTW for i=1:length(speakerData2) tInit=clock; name=speakerData2(i).name; fprintf('%d/%d: speaker=%s\n', i, length(speakerData2), name); for j=1:length(speakerData2(i).sentence) % fprintf('\tsentence=%d ==> ', j); % t0=clock; inputSentence=speakerData2(i).sentence(j); [speakerIndex, sentenceIndex, minDistance]=speakerId(inputSentence, speakerData1); computedName=speakerData1(speakerIndex).name; % fprintf('computedName=%s, time=%.2f sec\n', computedName, etime(clock, t0)); speakerData2(i).sentence(j).correct=strcmp(name, computedName); speakerData2(i).sentence(j).computedSpeakerIndex=speakerIndex; speakerData2(i).sentence(j).computedSentenceIndex=sentenceIndex; speakerData2(i).sentence(j).computedSentencePath=speakerData1(speakerIndex).sentence(sentenceIndex).path; end speakerData2(i).correct=[speakerData2(i).sentence.correct]; speakerData2(i).rr=sum(speakerData2(i).correct)/length(speakerData2(i).correct); fprintf('\tRR for %s = %.2f%%, ave. time = %.2f sec\n', name, 100*speakerData2(i).rr, etime(clock, tInit)/length(speakerData2(i).sentence)); end correct=[speakerData2.correct]; overallRr=sum(correct)/length(correct); fprintf('Ovderall RR = %.2f%%\n', 100*overallRr); fprintf('Save speakerData1 and speakerData2 to speakerData.mat\n'); save speakerData speakerData1 speakerData21/3: speaker=9761215 RR for 9761215 = 80.00%, ave. time = 0.05 sec 2/3: speaker=9761217 RR for 9761217 = 100.00%, ave. time = 0.03 sec 3/3: speaker=9762115 RR for 9762115 = 100.00%, ave. time = 0.05 sec Ovderall RR = 93.33% Save speakerData1 and speakerData2 to speakerData.mat

After obtaining the overall recognition rate, we can compute statistics of each person, and also list the misclassified utterances with their false output, as shown in the following example:

Example 5: speakerIdTextDependent/goPostAnalysis.msidPrm=sidPrmSet; load speakerData.mat correct=[speakerData2.correct]; overallRr=sum(correct)/length(correct); % ====== Display each person's performance [junk, index]=sort([speakerData2.rr]); sortedSpeakerData2=speakerData2(index); outputFile=sprintf('%s/personRr_rr=%f%%.htm', sidPrm.outputDir, 100*overallRr); structDispInHtml(sortedSpeakerData2, sprintf('Performance of all persons (Overall RR=%.2f%%)', 100*overallRr), {'name', 'rr'}, [], [], outputFile); % ====== Display misclassified utterances sentenceData=[sortedSpeakerData2.sentence]; sentenceDataMisclassified=sentenceData(~[sentenceData.correct]); outputFile=sprintf('%s/sentenceMisclassified_rr=%f%%.htm', sidPrm.outputDir, 100*overallRr); structDispInHtml(sentenceDataMisclassified, sprintf('Misclassified Sentences (Overall RR=%.2f%%)', 100*overallRr), {'path', 'computedSentencePath'}, [], [], outputFile);

This is a very important step toward error analysis for further improve the classification system.
Data Clustering and Pattern Recognition (資料分群與樣式辨認)